Show the code
from coppy.rng import base, evd, archimedean
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use('seaborn-whitegrid')The modeling of dependence between random variables is an important subject in several applied fields of science. To this aim the copula function can be used as a margin-free description of the dependence structure. Several copulae belong to specific families such as Archimedean, Elliptical or Extreme. While software implementation of copulae has been thoroughly explored in R software, methods to work with copula in \textsf{Python} are still in their infancy. To promote the dependence modeling with copula in \textsf{Python}, we have developed COPPY, a library that provides a range of random vector generation vector for copulae.
Random number generation, copulas
Modeling dependence relations between random variables is a topic of interest in probability theory and statistics. The most popular approach is based on the second moment of the underlying random variables namely, the covariance. It is well known that only linear dependence can be captured by the covariance and it is characterizing only for a few models, the multivariate normal distribution. As a beneficial alternative of dependence, the concept of copulae going back to Sklar (1959). The copula C : [0,1]^d \rightarrow [0,1] of a random vector \textbf{X} = (X_0, \dots, X_{d-1}) with d \geq 2 allows us to separate the effect of dependence from the effect of the marginal distribution such as:
\mathbb{P}\left\{ X_0 \leq x_0, \dots, X_{d-1} \leq x_{d-1} \right\} = C\left(\mathbb{P} \{X_0 \leq x_0\}, \dots, \mathbb{P}\{X_{d-1} \leq x_{d-1} \}\right),
where (x_0, \dots, x_{d-1}) \in \mathbb{R}^d. The main consequence of this identity is that the copula completely characterizes the stochastic dependence between the margins of \textbf{X}.
In other words, copulae allow to model marginal distributions and dependence structure separately. Furthermore, motivated by Sklar’s theorem, the problem of investigating stochastic dependences is being reduced to the problem of investigating multivariate distribution function on the unit hypercube [0,1]^d with uniform margins. The theory of copulae has been of prime interest for many applied fields of science, such as quantitative finance, actuarial or environmental sciences. This increasing number of applications has led to a demand for statistical methods. To name some, semiparametric estimation C. Genest, Ghoudi, and Rivest (1995), nonparametric estimation of copulae where investigated while Gijbels, Omelka, and Veraverbeke (2015), Portier and Segers (2018) analyse the weak convergence of the nonparametric estimation of conditional copulae. Such results are established under a fixed arbitrary fixed dimension d \geq 2 but several investigations (see Einmahl and Lin (2006), Einmahl and Segers (2021)) are done for functional data for the tail copula which capture the dependence in the upper tail. The empirical version of this function is shown to converge weakly to a Gaussian process which is similar in structure to the weak limit of the empirical copula process.
While software implementation of copulas has been thoroughly explored in \textsf{R} software see, , A. G. Stephenson (2002), Jun Yan (2007), Schepsmeier et al. (2019), methods to work with copula in \textsf{Python} are still in their infancy. As far as we know, copulas devoted module in \textsf{Python} are mainly designed for modeling (see Alvarez et al. (2021), Bock and Chapman (2021)). These modules use maximum likelihood methods to estimate parametrically the copula from observed data and propose to sample synthetic data through the estimated copula model. Nevertheless, it does not allow to generate random numbers from a given copula model where the user just have to specify the desired parameter. Furthermore, only archimedean and elliptical families are investigated and the multivariate case has barely been touched. Here, we propose to implement a wide range of copulas which include the extreme value class and, if possible, in arbitrary fixed dimension d \geq 2.
Through this article we adopt the following notational conventions such as all the indices will start at 0 as in \textsf{Python}. Consider (\Omega, \mathcal{A}, \mathbb{P}) and let \textbf{X} = (X_0, \dots, X_{d-1}) be d-dimensional random vector with values in (\mathbb{R}^d, \mathcal{B}(\mathbb{R}^d)), with d \geq 2. This random vector has a joint distribution F with copula C and its margins are denoted by F_j(x) = \mathbb{P}\{X_j \leq x\} for all x \in \mathbb{R} and j \in \{0, \dots, d-1\}. Denote by \textbf{U} = (U_0, \dots, U_{d-1}) a d random vector with copula C and uniform margins. All bold letters \textbf{x} will denote a vector.
The module COPPY is implemented using the object-oriented features of the \textsf{Python} language. The classes are designed for Archimedean, elliptical, extreme value copulae. Section 2 briefly presents the classes defined in the module. Section 3 describes methods to generate random vectors. Section 4 presents an application of the COPPY in the field of modelization of pairwise dependence between maxima. Section 5 discusses some further improvements of the module and concludes. Section 6.1 - Section 6.5 are devoted to define and illustrate all the parametric copula models implemented in the module.
Three main classes are defined in the COPPY module : Multivariate, Archimedean and Extreme classes (see Figure 1 for a description of the architecture of the code). The Multivariate class is designed to define multivariate copula (including the bivariate case) and is located at the highest level of the code architecture. As a generic class, it contains methods to instantiate the copula object or to sample from a copula with desired margins using inversion methods for example. At a second level, we may find its childrens, namely Archimedean, Extreme and Gaussian or Student. These are copulae which belong to different families of copulae, namely Archimedean and extreme value copula for Archimedean and Extreme and elliptical copula for Gaussian and Student. This partition is relevant theoretically as Archimedean and extreme value copulae are studied on their own (see Charpentier and Segers (2009) or Christian Genest and Segers (2009) to name a few) but also in practice as they contain effective sampling methods. However, Gaussian and Student classes are split as the most effective sampling algorithms are specific for each and cannot be generalized in a broader elliptical class. At the third level of the architecture, we may find some important Archimedean and extreme value copulae parametric models (depicted as blue in Figure 1). These parametric models contain methods such as the generator function \varphi (see Section 2.1) for the Archimedean and the Pickands dependence function A (see Section 2.2) for the extreme value (in green in Figure 1). We recall in Section 2.1 the definition of Archimedean copulae and some of its properties in high dimensional spaces. A characterization of extreme value copulae is given in Section 2.2. For each section, the reader is referred to the to discover the wide range of copulae implemented in the module as Section 6.1 and Section 6.3 for Archimedean copulae, Section 6.2 and Section 6.4 for extreme value copulae and lastly Section 6.5} for Elliptical copulae.
Let \varphi be a generator which is a strictly decreasing, convex function from [0,1] to [0, \infty] such that \varphi(1) = 0 and \varphi(0) = \infty. We denote by \varphi^\leftarrow the generalized inverse of \varphi. Let denote by
C(\textbf{u}) = \varphi^\leftarrow (\varphi(u_0)+ \dots + \varphi(u_{d-1})). \qquad(1)
If this relation holds and C is a copula function, then C is called an Archimedean copula. A characterization for Equation 1 to be a copula is that the generator needs to be a d-monotonic function, \varphi is differentiable there up to the order d and the derivatives satisfy
(-1)^k \left(\varphi\right)^{(k)}(x) \geq 0, \quad k \in \{1, \dots, d\} \qquad(2)
for x \in (0, \infty) (see corollary 2.1 of McNeil and Nešlehová (2009)). It is peculiarly interesting to emphasize that d-monotone Archimedean inverse generators do not necessarily generate Archimedean copulae in dimensions higher than d. To that extent, some Archimedean subclasses are only implemented for the bivariate case as in a greater dimension, they do not generate an Archimedean copula. In the bivariate case, it is worth noticing that Equation 2 can be interpreted as \varphi be a convex function.
Implemented Archimedean copula classes in the module are commonly used one-parameter families, such as Clayton (Clayton (1978)), Gumbel (Gumbel (1960)), Joe (Joe (1997)), Frank Frank (1979) and AMH (Ali, Mikhail, and Haq (1978)) copulae for the multivariate case. Remark that every Archimedean copulae are always symmetric and in dimension 3 or higher only positive association are allowed. For the specific bivariate case, other families such as numbers from 4.2.9 to 4.2.15 and 4.2.22 of Section 4.2 of Nelsen (2007) are implemented. We refer the reader to Section 6.1 and Section 6.3 for definitions and illustrations of these parametric copula models.
Investigating the notion of copulae within the framework of multivariate extreme value theory leads to the so-called extreme value copulae (see Gudendorf and Segers (2010) for an overview) defined as C(\textbf{u}) = \exp \left( - \ell(-\ln(u_0), \dots, -\ln(u_{d-1})) \right), \quad \textbf{u} \in (0,1]^d, \qquad(3) with \ell : [0,\infty)^d \rightarrow [0,\infty) the stable tail dependence function which is convex, homogeneous of order one, namely \ell(c\textbf{x}) = c \ell(\textbf{x}) for c > 0 and satisfies \max(x_0,\dots,x_{d-1}) \leq \ell(x_0,\dots,x_{d-1}) \leq x_0+\dots+x_{d-1}, \forall \textbf{x} \in [0,\infty)^d. Denote by \Delta^{d-1} = \{\textbf{w} \in [0,1]^d : w_0 + \dots + w_{d-1} = 1\} the unit simplex. By homogeneity, \ell is characterized by the A : \Delta^{d-1} \rightarrow [1/d,1], which is the restriction of \ell to the unit simplex \Delta^{d-1} : \ell(x_0, \dots,x_{d-1}) = (x_0 + \dots + x_{d-1}) A(w_0, \dots, w_{d-1}), \quad w_j = \frac{x_j}{x_0 + \dots + x_{d-1}}, \qquad(4) for j \in \{1,\dots,d-1\} and w_0 = 1 - w_1 - \dots - w_{d-1} with \textbf{x} \in [0, \infty)^d \setminus \{\textbf{0}\}.
From a practical point of view, the family of extreme value copulae is very rich and arises naturally as the limiting distribution of properly normalised component-wise maxima. Furthermore, it contains a rich variety of parametric models and allows asymmetric dependence. For the multivariate framework, the logistic copula (or Gumbel, see Gumbel (1960)) and the asymmetric logistic copula (Tawn (1990)) are implemented. We emphasize here that the logistic copula is the sole model that is both Archimedean and extreme value. Bivariate extreme value copulae which are included in the library are asymmetric logistic (Tawn (1988)), asymmetric negative logistic (Joe (1990)), asymmetric mixed (Tawn (1988)), Hüsler and Reiss (Hüsler and Reiss (1989)), the t-EV (Demarta and McNeil (2005)),Bilogistic (Smith (1990)), Dirichlet ( Boldi and Davison (2007)) models. The reader is again invited to the Section 6.2 and Section 6.4 for precise definitions of these models.
We propose a \textsf{Python}-based implementation for generate random number from a wide variety of copula. COPPY requires some external packages in order to work. They are very few are freely available online.
NumPy version 1.6.1 or newer. This is the fundamental package for scientific computing, it contains linear algebra functions and matrix / vector objects (Van Der Walt, Colbert, and Varoquaux (2011)).SciPy version 1.7.1 or newer. A library of open-source software for mathematics, science and engineering (Jones, Oliphant, and Peterson (2001)).The random vector generator methods in the code are sample_unimargin and sample for the Multivariate class. The first one generates a sample where margins are uniformly distributed on the unit segment [0,1] while the second from the chosen margins. In Section 3.1, we present an algorithm used to sample from a copula using the conditioning method. This method is very general and may be used for every copula that is sufficiently smooth (see Equation 6, Equation 7 and Equation 8 below). However, the practical infeasibility of the algorithm in a dimension higher than 2 and the numerical inversion which is computationally intensive call for efficient ways to sample in greater dimensions. The purpose of Section 3.2 is to present those methods and details those used in the module. In each Section, typical lines of codes are presented to instantiate a copula and to sample with COPPY.
In the sequel, all \textsf{Python} code demonstration assumes that the modules have been loaded :
from coppy.rng import base, evd, archimedean
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use('seaborn-whitegrid')In this subsection, we will address the problem of generating a bivariate sample from a specified joint distribution with d = 2. Suppose that we want to sample a bivariate random vector \textbf{X} with copula C. If the components are independents, the procedure is straightforward as we can sample in one hand from X_0 and from X_1 on the other hand. In the case of copula, this case is only a special one.
One procedure to generate a pair (u_0,u_1) from \textbf{U} is the conditional distribution method. This method is convenient if the law of U_1 given U_0 can be easily sampled. An interesting property of copulae is the link between the conditional distribution and the derivative of the copula function (see Nelsen (2007), page 41), namely
c_{u_0}(u_1) \triangleq \mathbb{P}\left\{ U_1 \leq u_1 | U_0 = u_0 \right\} = \frac{\partial C(u_0,u_1)}{\partial u_0}. \qquad(5)
Thus, a convenient algorithm to simulate bivariate copula as long as the copula admits a first partial derivative with respect to its first component is given by Algorithm 1.
For the sixth step of Algorithm 1, it is equivalent to find u_1 \in [0,1] such that c_{u_0}(u_1) - t_1 = 0 holds. This u_1 always exists because for every u \in ]0,1[, $ 0 c_{u_0}(u) $ and that u \mapsto c_{u_0}(u) is an increasing function (see Theorem 2.2.7 of for a proof). This step is solved using the function from SciPy module. A sufficient condition in order that C admits a first partial derivative in the Archimedean and extreme value case is that the generator \varphi and the Pickands dependence function A are continuously differentiable on ]0,1[. Then the first partial derivatives of C are given respectively by :
\frac{\partial C}{\partial u_0}(u_0,u_1) = \frac{\varphi'(u_0)}{\varphi'(C(u_0,u_1))}, \quad (u_0,u_1) \in ]0,1[^2, \qquad(6)
\frac{\partial C}{\partial u_0}(u_0,u_1) = \frac{\varphi'(u_0)}{\varphi'(C(u_0,u_1))}, \quad (u_0,u_1) \in ]0,1[^2, \qquad(7)
where t = log(u_1) / log(u_0u_1) \in (0,1) and \mu(t) = A(t) - tA'(t).
We now have all the necessary theoretical tools to give details on how the COPPY module is designed. The file base.py contains the Multivariate class where the sample method sample from \textbf{X} with copula C. To do so, we use the inversion method that is to sample from \textbf{U} using Algorithm 1 and we compose the corresponding uniform margins by F_j^\leftarrow. Equation 5 indicate that the sole knowledge of A and \varphi and their respective derivatives is needed in order to perform the sixth step of Algorithm 1. For that purpose, Algorithm 1 named as cond_sim in the code and is located inside the Archimedean and Extreme class. Then each child of the bivariate Archimedean ( Extreme) class is thus defined by its generator \varphi (resp. A), it’s derivative \varphi' ( A') and it’s inverse \varphi^\leftarrow as emphasized in greed in Figure 1. Namely, we perform Algorithm 1 for the Archimedean subclasses Frank, AMH, Clayton (when \theta < 0 for the previous three), Nelsen_9, Nelsen_10, Nelsen_11, Nelsen_12, Nelsen_13, Nelsen_14, Nelsen_15 and Nelsen_22. For the Extreme class, such algorithm is performed for the Asy_neg_log and Asy_mixed models. For other models, faster algorithms are known and thus implemented, we refer to Section 3.2 for details.
The following code illustrates the random vector generation for a bivariate Archimedean copula. By defining the parameter of the copula and the sample’s length, constructor of this copulae are available and might be call using Clayton() method such as :
n_sample, theta = 1024, -0.5
copula = archimedean.Clayton(theta = theta, n_sample = n_sample)To obtain a sample with uniform margins and Clayton copula, we make use of the \texttt{sample\_unimargin} as
sample = copula.sample_unimargin()Here’s, the \texttt{sample} object is a \texttt{Numpy} array with 2 columns and 1024 rows where each row contains a realization from a Clayton copula
seagreen = sns.light_palette("seagreen", as_cmap = True)
fig, ax = plt.subplots()
ax.scatter(sample[:,0], sample[:,1], color = 'white', edgecolor = seagreen(0.25),s = 5)
ax.set_xlabel(r'$u_0$')
ax.set_ylabel(r'$u_1$')
plt.show()We will now address the simulation of multivariate Archimedean and Extreme value copulae proposed on the COPPY module. In the multivariate case, the link between partial derivatives conditioning remains. Indeed, let (U_0, \dots, U_{d-1}) be a d-dimensional random vector with uniform margins and copula C, the conditional distribution of U_k given the values of U_0, \dots, U_{k-1} is
\mathbb{P}\left\{ U_k \leq u_k | U_0 = u_0, \dots, U_{k-1} = u_{k-1} \right\} = \frac{\partial^{k-1} C(u_0, \dots, u_k,1,\dots,1)/\partial u_0 \dots \partial u_{k-1}}{\partial^{k-1} C(u_0, \dots, u_{k-1},1,\dots,1) / \partial u_0 \dots \partial u_{k-1}} \qquad(8)
for k \in \{1,\dots, d-1\}. The conditional simulation algorithm may be written as follows.
Nevertheless, evaluation of the inverse conditional distribution becomes increasingly complicated as the dimension d increases. Furthermore, it can be difficult for some models to derive a closed form of Equation 8 that makes it impossible to implement it in a general algorithm where only the dimension d is an input. For multivariate Archimedean copulae, McNeil and Nešlehová (2009) give a method to generate a random vector from the d-dimensional copula C with generator \varphi (see Section 5.2 of loc. cit.). A stochastic representation for Archimedean copulae generated by a d-monotone generator is given by
\textbf{U} = \left( \varphi^\leftarrow(R S_1), \dots, \varphi^\leftarrow(RS_d) \right) \sim C, \qquad(9)
where R \sim F_R, the radial distribution which is independent of S and S is distributed uniformly in the unit simplex \Delta^{d-1}. One of challenging aspect of this algorithm is to have an accurate evaluation of the radial distribution of the Archimedean copula and thus to numerically inverse this distribution. The associated radial distribution for the \textsf{Clayton} copula is given in Example 3.3 in the cited paper above while those of the \textsf{Joe}, \textsf{AMH}, \textsf{Gumbel} and \textsf{Frank} copulae are given in Hofert, Mächler, and McNeil (2012). In general, one can use numerical inversion algorithms for computing the inverse of the radial distribution, however it will leads to spurious numerical errors. Other algorithms exist when the generator is known to be the Laplace-Stieljes transform, denoted as \mathcal{LS}, of some positive random variable (see Marshall and Olkin (1988), Frees and Valdez (1998)). This positive random variable is often referenced as the frailty distribution. In this framework, Archimedean copulae allow for the stochastic representation
\textbf{U} = \left( \varphi^\leftarrow (E_1/V), \dots, \varphi^\leftarrow(E_d /V)\right) \sim C,
with V \sim F = \mathcal{LS}^{-1}[\varphi^\leftarrow] the frailty and E_1, \dots, E_d are distributed i.i.d. according to a standard exponential and independent of V. The algorithm to sample from this is thus :
In this framework, we define \texttt{\_frailty\_sim} method defined inside the \textbf{Archimedean} class which performs Algorithm 2. Then, each Archimedean copula where the frailty distribution is known are thus defined by the generator \varphi, it’s inverse \varphi^\leftarrow and the frailty distribution denoted as \mathcal{LS}^{-1}[\varphi^\leftarrow] as long as we know the frailty. This is the case for \texttt{Joe}, \texttt{Clayton}, \texttt{AMH} or \texttt{Frank}.
For the extreme value case, algorithms have been proposed as in A. Stephenson (2003) (see Algorithms 2.1 and 2.2) who proposes sampling methods for the Gumbel and the asymmetric logistic model. These algorithms are implemented in the COPPY module. Note that these algorithms are model-specific, thus the \texttt{sample\_unimargin} method is exceptionally located in the corresponding child of the multivariate Extreme class. Another procedure designed by Dombry, Engelke, and Oesting (2016) to sample from multivariate extreme value models using extremal functions (see Algorithm 2 of the reference cited above) is also of prime interest. For the implemented models using this algorithm, namely Hüsler-Reiss, tEV, Bilogistic and Dirichlet models, a method called \texttt{rExtFunc} is located inside each classes which allows to generate an observation from the according law of the extremal function.
Sample from the Gaussian and Student copula are directly given by Algorithm 5.9 and 5.10 respectively of Alexander J. McNeil (2005). As each algorithm is model specific, the \texttt{sample\_unimargin} method is located inside the Gaussian and Student classes.
We present how to construct a multivariate Archimedean copula and to generate random vectors from this model. Introducing the parameters of the copula, we appeal the following lines to construct our copula object :
d, theta, n_sample = 3, 2.0, 1024
copula = archimedean.Clayton(theta = theta, n_sample = n_sample, d = d)We now call the \texttt{sample\_unimargin} method to obtain vectors randomly generate.
sample = copula.sample_unimargin()We thus represent in three dimensions these realizations below.
fig = plt.figure()
ax = fig.add_subplot(111, projection = '3d')
ax.scatter3D(sample[:,0], sample[:,1], sample[:,2], s = 5, edgecolor = seagreen(0.5), color = 'white')
ax.set_xlabel(r'$u_0$')
ax.set_ylabel(r'$u_1$')
ax.set_zlabel(r'$u_2$')
plt.show()We now proceed to a case study where we use our module to assess, under a finite sample framework, the asymptotic properties of an estimator of the \lambda-madogram when data are completely missing at random (MCAR). This case study comes from numerical results of Boulin et al. (2021). The \lambda-madogram belongs to a family of estimators, namely the madogram, which is of prime interest in environmental sciences, as it is designed to model pairwise dependence between maxima in space, see Bernard et al. (2013) Bador et al. (2015) Saunders, Stephenson, and Karoly (2021) where the madogram was used as a dissimilarity measure to perform clustering. Where in several fields, for example econometrics (Wooldridge (2007)) or survey theory (Boistard, Chauvet, and Haziza (2016)), the MCAR hypothesis appears to be a strong hypothesis, this hypothesis is more realistic in environmental research as the missingness of one observation is usually due to instruments, communication and processing errors that may be reasonably supposed independent of the quantity of interest. In Section 4.1, we define objects and properties of interests while in Section 4.2 we describe a detailed tutorial in \textsf{Python} and with COPPY module to compare the asymptotic variance with an empirical counterpart of the \lambda-madogram with \lambda = 0.5.
It was emphasized that the possible dependence between maxima can be described with the extreme value copula. This function is completely characterized by the Pickands dependence function (see Equation 4}) where the latter is equivalent to the \lambda-madogram introduced by Naveau et al. (2009) and defined as
\nu(\lambda) = \mathbb{E}\left[ \left|\{F_0(X_0)\}^{1/\lambda} - \{F_1(X_1)\}^{1/(1-\lambda)} \right|\right], \qquad(10)
with \lambda \in (0,1) and if \lambda = 0 and 0<u<1, then u^{1/\lambda} = 0 by convention. The latter quantity took its inspiration from the extensively used geostatistics tool, namely, the variogram (see Chapter 1.3 of Gaetan and Guyon (2008) for a definition and some classical properties). The \lambda-madogram can be interpreted as the L_1-distance between the uniform margins elevated to the inverse of the corresponding weights \lambda and 1-\lambda. This quantity describes the dependence structure between extremes by its relation with the Pickands dependence function, if we suppose that C is an extreme value copula as in Equation 3, we have
A(\lambda) = \frac{\nu(\lambda) + c(\lambda)}{1-\nu(\lambda) - c(\lambda)}, \qquad(11)
with c(\lambda) = 2^{-1} (\lambda / (1-\lambda) + (1-\lambda)/\lambda) (see Proposition 3 of Marcon et al. (2017) for details).
We consider independent and identically distributed i.i.d. copies \textbf{X}_1, \dots, \textbf{X}_n of \textbf{X}. In presence of missing data, we do not observe a complete vector \textbf{X}_i for i \in \{1,\dots,n\}. We introduce \textbf{I}_i \in \{0,1\}^2 which satisfies, \forall j \in \{0,1\}, I_{i,j} = 0 if X_{i,j} is not observed. To formalize incomplete observations, we introduce the incomplete vector \tilde{\textbf{X}}_i with values in the product space \bigotimes_{j=1}^2 (\mathbb{R} \cup \{\textsf{NA}\}) such as
\tilde{X}_{i,j} = X_{i,j} I_{i,j} + \textsf{NA} (1-I_{i,j}), \quad i \in \{1,\dots,n\}, \, j \in \{0,\dots, d-1\}.
We thus suppose that we observe a 4-tuple such as
(\textbf{I}_i, \tilde{\textbf{X}}_i), \quad i \in \{1,\dots,n\}, \qquad(12)
for all i \in \{1, \dots,n \}, \textbf{I}_{i} are copies from \textbf{I} = (I_0, I_1) where I_j is distributed according to a Bernoulli random variable \mathcal{B}(p_j) with p_j = \mathbb{P}(I_j = 1) for j \in \{0,1\}. We denote by p the probability of observing completely a realization from \textbf{X}, that is p = \mathbb{P}(I_0=1, I_1 = 1). In Boulin et al. (2021), hybrid and corrected estimators, respectively denoted as \hat{\nu}_n^{\mathcal{H}} and \hat{\nu}_n^{\mathcal{H*}}, are proposed to estimate nonparametrically the \lambda-madogram in presence of missing data completely at random. Furthermore, a closed expression of their asymptotic variances for \lambda \in ]0,1[ is also given. This result is summarized in the following proposition.
Theorem 1 (Boulin et al. (2021)) Let (\textbf{I}_i, \tilde{\textbf{X}_i})_{i=1}^n be a samble given by Equation 12. For \lambda \in ]0,1[, if C is an extreme value copula with Pickands dependence function A, we have as n \rightarrow \infty \begin{align*} &\sqrt{n} \left(\hat{\nu}_n^{\mathcal{H}}(\lambda) - \nu( \lambda)\right) \overset{d}{\rightarrow} \mathcal{N}\left(0, \mathcal{S}^{\mathcal{H}}(p_1,p_2,p, \lambda)\right), \\ &\sqrt{n} \left(\hat{\nu}_n^{\mathcal{H}*}(\lambda) - \nu( \lambda)\right) \overset{d}{\rightarrow} \mathcal{N}\left(0, \mathcal{S}^{\mathcal{H}*}(p_1,p_2,p, \lambda)\right), \end{align*}
where \mathcal{S}^{\mathcal{H}}(p_1,p_2,p, \lambda) and \mathcal{S}^{\mathcal{H}*}(p_1,p_2,p, \lambda) are the asymptoptic variances of the random variables.
Benefiting of generating data with COPPY we are thus able, with Monte Carlo simulation, to assess theoretical results given by Theorem 1 in a finite sample setting. For that purpose, we implement a \textsf{Monte\_Carlo} class (in \texttt{monte\_carlo.py} file) which contains some methods to perform some Monte Carlo iterations for a given extreme value copula. Before going any further, we have to import the necessary libraries
from coppy.rng import base, evd, archimedean, monte_carlo
from scipy.stats import norm, expon
def gauss_function(x, x0, sigma):
return np.sqrt( 1 / (2 * np.pi * sigma ** 2 ) ) * np.exp(- ( x - x0) ** 2 / (2 * sigma **2) )We thus set up parameters to simulate our bivariate dataset. For this subsection, we choose the asymmetric negative logistic model (see Section 6.2 for a definition) with parameters \theta = 10, \psi_1 = 0.1, \psi_2 = 1.0.
np.random.seed(42)
n_sample = 2048
theta, psi1, psi2 = 10, 0.1, 1.0We choose the standard normal and exponential as margins. To simulate this sample, the following lines should be typed:
copula = evd.Asy_neg_log(theta = theta, psi1 = psi1, psi2 = psi2,
n_sample = n_sample)
sample = copula.sample(inv_cdf = [norm.ppf, expon.ppf])The 2048 \times 2 array contains 2048 realization of the asymmetric negative logistic model where the first column is distributed according to a standard normal random variable and the second column as a standard exponential. This distribution is depicted below. To obtain it, one needs the following lines of command :
seagreen = sns.light_palette("seagreen", as_cmap = True)
fig, ax = plt.subplots()
ax.scatter(sample[:,0], sample[:,1], color = seagreen(0.25),s = 1, marker = '.')
ax.set_xlabel(r'$x_0$')
ax.set_ylabel(r'$x_1$')
plt.show()Before going into further details, we will present the missing mechanism. Let V_0 and V_1 be random variables uniformly distributed under the ]0,1[ segment with copula C_{(V_0,V_1)}. We set I_0 = 1_{\{V_0 \leq p_0\}} and $ I_1 = 1_{{V_1 p_1}}$. It is thus immediate that I_0 \sim \mathcal{B}(p_0) and I_1 \sim \mathcal{B}(p_1) and p \triangleq \mathbb{P}\{I_0 = 1, I_1 =1 \} = C_{(V_0,V_1)}(p_0, p_1). For our illustration, we will take C_{(V_0,V_1)} as a \texttt{Joe} copula with parameter \tau = 2.0 (We refer to Appendix Section 6.1} for a representation of this copula). Thereby, if X_0 is not observed, it is more likely to not observe X_1 also. A contrario, the observation of X_0 doesn’t influence the observation of the other random variable X_1. Indeed, for this copula, more likely is to observe a realization v_0 \geq 0.8 from V_0 if v_1 \geq 0.8 from V_1. If we observe v_1 < 0.8, the realization v_0 is close to be independent of v_1. In climate studies, extreme events could damage the recording instrument in the surrounding regions where they occur, thus the missingness of one variable may depend from others. We initialize the copula C_{(V_0,V_1)} by the following line
copula_miss = archimedean.Joe(theta = 2.0, n_sample = n_sample)For a given \lambda \in ]0,1[ we now want to estimate a \lambda-madogram with a sample simulate from the asymmetric negative logistic and where some observations would be of lack by the missing mechanism described above. We thus replicate this step several times to compute an empirical counterpart of the asymptotic variance. The \texttt{Monte\_Carlo} object has been designed in this way : we specify the number of iterations n_{iter} (take n_{iter} = 1024), the chosen extreme value copula (asymmetric negative logistic model), the missing mechanism (described by C_{(V_0,V_1)} and p_0 = p_1 = 0.9) and \lambda (noted \texttt{w}). We thus write the following lines :
u = np.array([0.9, 0.9])
n_iter, P, w = 1024, [[u[0], copula_miss._C(u)], [copula_miss._C(u), u[1]]], np.array([0.5,0.5])
monte = monte_carlo.Monte_Carlo(n_iter = n_iter, n_sample = n_sample,
copula = copula, copula_miss = copula_miss, w = w, P = P)The \texttt{Monte\_Carlo} object is thus initialized with all parameters needed. We may use the \texttt{simu} method to generate a \texttt{DataFrame} (a \texttt{Pandas} object) composed out 1024 rows and 3 columns. Each row contains an estimate of the \lambda-madogram, \hat{\nu}_n^{\mathcal{H}*} in Theorem 1 (\texttt{FMado}), the sample length n (\texttt{n}) and the normalized estimation error (\texttt{scaled}). We thus call the \texttt{simu} method.
df_wmado = monte.finite_sample(inv_cdf = [norm.ppf, expon.ppf], corr = True)
print(df_wmado.head()) wmado n scaled
0 0.150451 2048.0 -0.153656
1 0.147472 2048.0 -0.288454
2 0.158530 2048.0 0.211972
3 0.153668 2048.0 -0.008084
4 0.153695 2048.0 -0.006827
Where \texttt{corr = True} specifies that we compute the corrected estimator, \hat{\nu}_n^{\mathcal{H}*} in Theorem 1. Now, using the \texttt{var\_mado} method defined inside in the Extreme class, we obtain the asymptotic variance for the given model and parameters from the missing mechanism. We obtain this quantity as follows
var_mado = copula.var_mado(w, p = copula_miss._C(u), P = P, corr = True)
print(var_mado)
print(df_wmado['scaled'].var())0.015417245591834503
0.01717151507670374
We propose here to check numerically the asymptotic normality with variance \mathcal{S}^{\mathcal{H}*} of the normalized estimation error of the corrected estimator. We have all data in hand and the asymptotic variance was computed by lines above. We thus write :
sigma = np.sqrt(var_mado)
x = np.linspace(min(df_wmado['scaled']), max(df_wmado['scaled']), 1000)
gauss = gauss_function(x, 0, sigma)
sns.displot(data = df_wmado, x = "scaled", color = seagreen(0.5),
kind = 'hist', stat = 'density', common_norm = False,
alpha = 0.5, fill = True, linewidth = 1.5)
plt.plot(x,gauss, color = 'darkblue')This article presents the construction and some implementations of the \textsf{Python} module COPPY for random copula sampling. This is a seminal work of software implement of copula modeling in \textsf{Python} and there’s much more to do. It is implicitly hoped that the potential diffusion of the software through everyone who need it may bring other implementations for multivariate modeling with copulae under \textsf{Python}. For example, choosing a copula to fit the data is an important but difficult problem. A robust approach to estimate copulae has been investigated recently by Alquier et al. (2020) using Maximum Mean Discrepancy. In link with our example, semiparametric estimation of copulae with missing data could be of great interest as proposed by Hamori, Motegi, and Zhang (2019).
Also, implement algorithm proposed by McNeil and Nešlehová (2009) to generate random vectors for Archimedean copulae has been tackled but, as expected, the numerical inversion gives spurious results especially where the parameter \theta and the dimension d are high. Furthermore, as the support of radial distribution is contained in the real line, the numerical inversion induces a greater time of computation. Further investigations are thus needed in order to generate random vectors from classical Archimedan models using the radial distribution.
A direction of improvement for the COPPY module is dependence modeling with Vine copula which has been recently a tool of high interest in the machine learning community see, *e.g.*, Lopez-Paz, Hernández-Lobato, and Zoubin (2013), Veeramachaneni, Cuesta-Infante, and O’Reilly (2015), Carrera, Santana, and Lozano (2016), Gonçalves, Von Zuben, and Banerjee (2016) or Sun, Cuesta-Infante, and Veeramachaneni (2019). Therefore, it strengthens the need of dependence modeling with copulae in \textsf{Python} as a not negligeable part of the machine learning community use this language. In link with this article, Vine copulae might be interesting to model dependencies between extreme as suggested by Simpson, Wadsworth, and Tawn (2021), Nolde and Wadsworth (2021). Last, another copula model could be implemented to model further dependencies. These implementations will push forward the scope of dependence modeling with \textsf{Python} and gives high quality usable tools for everyone in needs.
Before giving the main details, we introduce some notations. Let B be the set of all nonempty subsets of \{1,\dots,d\} and B_1 = \{b \in B, |b| = 1\}, where |b| denotes the number of elements in thet set b. We note by B_{(j)} = \{b \in B, j \in b\}. For d=3, the Pickands is expressed as
\begin{align*} A(\textbf{w}) =& \alpha_1 w_1 + \psi_1 w_2 + \phi_1 w_3 + \left( (\alpha_2 w_1)^{\theta_1} + (\psi_2w_2)^{\theta_1} \right)^{1/\theta_1} + \left( (\alpha_3 w_2)^{\theta_2} + (\phi_2w_3)^{\theta_2} \right)^{1/\theta_2} \\ &+ \left( (\psi_3 w_2)^{\theta_3} + (\phi_3w_3)^{\theta_3} \right)^{1/\theta_3} + \left( (\alpha_4 w_1)^{\theta_4} + (\psi_4 w_2)^{\theta_4} + (\phi_4 w_3)^{\theta_4} \right)^{1/\theta_4}, \end{align*}
where \boldsymbol{\alpha} = (\alpha_1, \dots, \alpha_4), \boldsymbol{\psi} = (\psi_1, \dots, \psi_4), \boldsymbol{\phi} = (\phi_1, \dots, \phi_4) are all elements of \Delta^3. We take \boldsymbol{\alpha} = (0.4,0.3,0.1,0.2), \boldsymbol{\psi} = (0.1, 0.2, 0.4, 0.3), \boldsymbol{\phi} = (0.6,0.1,0.1,0.2) and \boldsymbol{\theta} = (\theta_1, \dots, \theta_4) = (0.6,0.5,0.8,0.3) as the dependence parameter.
The Dirichlet model is a mixture of m Dirichlet densities, that is h(\textbf{w}) = \sum_{k=1}^m \theta_k \frac{\Gamma(\sum_{j=1}^d \sigma_{kj})}{\Pi_{j=1}^d \Gamma(\sigma_{kj})} \Pi_{j=1}^d w_j^{\sigma_{kj}-1}, with \sum_{k=1}^m \theta_k = 1, \sigma_{kj} > 0 for k \in \{1,\dots,m\} and j \in \{1, \dots, d\}. Let \mathcal{D} \in [0, \infty)^{(d-1)\times (d-1)} denotes the space of symmetric strictly conditionnaly negative definite matrices that is
\begin{align*} \mathcal{D}_{k} = \Big\{ \Gamma \in [0,\infty)^{k \times k} : a^\top \Gamma a < 0 \; \textrm{for all } a \in \mathbb{R}^{k} \setminus \{\textbf{0}\} \, \textrm{with } \sum_{j=1}^{d-1} a_j = 0, \\ \Gamma_{ii} = 0, \Gamma_{ij} = \Gamma_{ji}, \quad 1 \leq i,j\leq k \Big\}. \end{align*}
For any 2 \leq k \leq d, consider m' = (m_1, \dots, m_k) with 1 \leq m_1 < \dots < m_k \leq d define \Sigma^{(k)}_m = 2 \left( \Gamma_{m_i m_k} + \Gamma_{m_j m_k} - \Gamma_{m_i m_j} \right)_{m_i m_j \neq m_k} \in [0,\infty)^{(d-1)\times(d-1)}. Furthermore, note S(\cdot | \Sigma^{(k)}_m) denote the survival function of a normal random vector with mean vector \textbf{0} and covariance matrix \Sigma^{(k)}. We now define : h_{km}(\textbf{y}) = \int_{y_k}^\infty S\left( (y_i - z + 2\Gamma_{m_i m_k})_{i=1}^{k-1} | \Gamma_{km}\right) e^{-z}dz for 2 \leq k \leq d. We denote by \Sigma^{(k)} the summation over all k-vectors m=(m_1,\dots,m_k) with 1\leq m_1 < \dots < m_k \leq d.
Let \textbf{X} \sim \textbf{E}_d(\boldsymbol{\mu}, \Sigma, \psi) be an elliptical distributed random vector with cumulative distribution F and marginal F_0, \dots, F_{d-1}. Then, the copula C of F is called an elliptical copula. We denote by \phi the standard normal distribution function and \boldsymbol{\phi}_\Sigma the joint distribution function of \textbf{X} \sim \mathcal{N}_d(\textbf{0}, \Sigma), where \textbf{0} is the d-dimensional vector composed out of 0. In the same way, we note t_{\theta} the distribution function of a standard univariate distribution t distribution and by \boldsymbol{t}_{\theta, \Sigma} the joint distribution function of the vector \textbf{X} \sim \boldsymbol{t}_{d}(\theta, \textbf{0}, \Sigma). A d squared matrix \Sigma is said to be positively semi definite if for all u \in \mathbb{R}^d we have :
u^\top \Sigma u \geq 0